4. Main Analysis - Exploratory Data Analysis
In this section, we will talk about the exploratory data analysis that was done on this data.
As mentioned before, we had a set of questions that we wanted to answer as part of the project. This section is divided into various subsections, each looking at a different perspective of data.
4.1 Introduction
First, Let’s look at some statistics on Air Travel.
Figure 4.1.1: Statistics on Air Travel
df <- year_data
# Total number of flights
# Number of inbound flights per state
so <- c("DestStateName")
ms_df_o <- df %>% select(so) %>% group_by(DestStateName) %>% summarise(n())
names(ms_df_o)[2] <- "outbound"
names(ms_df_o)[1] <- "region"
# Number of outbound flights per state
sd <- c("OriginStateName")
ms_df_d <- df %>% select(sd) %>% group_by(OriginStateName) %>% summarise(n())
names(ms_df_d)[2] <- "inbound"
names(ms_df_d)[1] <- "region"
# Merging the dataframes for inbound and outbound flights
ms_df_tot <- merge(x = ms_df_d,y=ms_df_o,on="region",all=TRUE)
ms_df_tot$value <- ms_df_tot$outbound + ms_df_tot$inbound
names(ms_df_tot)[1] <- "region"
ms_df_tot$region <- tolower(ms_df_tot$region)
# Plotting the choropleth map for total number of flights per state
state_choropleth(ms_df_tot,legend = "Number of flights", num_colors = 9) +
ggtitle("Which states are big on flying?",
subtitle = "Number of Inbound and Outbound flghts per state in 2017")+
labs(caption = "Source: http://www.bts.gov/help/aviation/") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
Suprisingly, as compared to the number of delayed flights and the number of cancelled flights, New York has lesser number of flights than expected and Georgia has a higher number of flight.
4.2: Dominance
Domination of an air route or air traffic is an important factor to find out the way america prefers to fly. While some airlines focus their operations primarily onthe major cities in the United states, there are some which specialise in regional routes by serving the tier 2 and tier 3 cities. In this section, we will try to identify the players that dominate the national airline traffic and which others eye for regional domination.
Figure 4.2.1: Checking Available Airline Options to choose from at Leading Airports
data$Reporting_Airline <- factor(data$MKT_UNIQUE_CARRIER,
levels = c("UA","AS","DL","B6","AA","F9","G4","HA","NK","VX","WN"),
labels = c('United','Alaska','Delta','Jet Blue', 'American', 'Frontier','Allegiant','Hawaiian','Spirit','Virgin Atlantic', 'Southwest'))
data2 <- data %>% select( ORIGIN, DEST, MKT_UNIQUE_CARRIER,FL_DATE, DEP_TIME )
sel1 <- data2 %>% group_by( ORIGIN ) %>% dplyr::summarise( total = n() ) %>% arrange( desc( total ) )
top9airports <- sel1$ORIGIN[1:9]
newdata <- data2[ which(data2$ORIGIN %in% top9airports), ]
newdata$ORIGIN <- factor(newdata$ORIGIN,
levels = c("ATL","ORD","DFW","DEN","LAX","CLT","PHX","SFO","SEA"),
labels = c('Atlanta','Chicago','Dallas Forth Worth','Denver','Los Angles','Charlotte','Phoenix','San Francisco','Seattle'))
grp <- newdata %>%
group_by(ORIGIN, MKT_UNIQUE_CARRIER) %>%
dplyr::summarise(Freq = n())
ggplot(grp, aes(x=MKT_UNIQUE_CARRIER,y= Freq ))+
geom_bar(position = "dodge", stat = "identity") +
facet_wrap(~ORIGIN)+
ggtitle("Can we choose the airline we fly ?",
subtitle = "Total no. of Flights operated by Domestic Carriers originating from 9 Busiest Airports") +
labs(x="Airlines",y = "Daily Flights", caption = "Source: http://www.bts.gov/help/aviation/") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
- We see that some airlines are the only major option at a specific airport.
- For ex: In Atlanta most of the flights are operated by Delta, while in Dallas and Seattle, American Airlines is the biggest operatr.
- In Phoenix, Los Angles and San Francisco there is no single largest airline.
Figure 4.2.2: Exploring patterns in Airline Routes ?
subnewdata <-newdata[ which(newdata$DEST %in% top9airports), ]
subnewdata$DEST <- factor(subnewdata$DEST,
levels = c("ATL","ORD","DFW","DEN","LAX","CLT","PHX","SFO","SEA"),
labels = c('Atlanta','Chicago','Dallas Forth Worth','Denver','Los Angles','Charlotte','Phoenix','San Francisco','Seattle'))
subnewdata <-subnewdata %>% group_by(ORIGIN, DEST) %>%
mutate(Freq = n())
subnewdata <-subnewdata %>% group_by(ORIGIN, DEST) %>%
mutate(Total = sum(Freq)) %>% ungroup()
newdata2 <- subnewdata %>% transform( routes=paste(ORIGIN,'-', DEST))
newdata3 <- newdata2 %>%group_by(routes) %>% dplyr::summarise(total=n())%>% arrange(desc(total))
rts <- c("San Francisco - Los Angles",
"Los Angles - Seattle ",
"Seattle - San Francisco",
"Denver - Los Angles",
"Los Angles - Phoenix",
"Atlanta - Chicago",
"Chicago - Los Angles",
"Phoenix - Denver",
"Los Angles - Dallas Forth Worth",
"Dallas Forth Worth - Chicago")
newdata3 <- newdata2 %>% filter(routes %in% rts)
grp3 <- newdata3 %>%
group_by(routes, MKT_UNIQUE_CARRIER) %>%
dplyr::summarise(Freq = n())
names(grp3)[2] <- 'Airlines'
ggplot(grp3, aes(x= routes, y= Freq, fill= Airlines ))+
geom_bar(position = "dodge", stat = "identity") +
ggtitle("What options do I have?",
subtitle = "Breakdown of Airlines serving the Top 10 Busiest Domestic Air Routes [ January 2018 ]") +
labs(x="Top 10 Busiest Routes",y = "Total No. of Flights", caption = "Source: http://www.bts.gov/help/aviation/") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_color_viridis_d()
- United Airlines serves all the routes.
- San Franciso - Los Angeles is the busiest Domestic Air Route.
- There are region spefic airlines like Southwest which serves only in the Southern States and Alaska Airlines which is presesnt only in Seattle - SF route.
Figure 4.2.3.1: Routes and Unique Routes
- The Graph shows that Southwestern Airline is dominating the scene for the total number of flights operated in United States.
- It is followed by Delta Airlines, American Airlines, SkyWest Airlines and United Airlines.
- Note that this data is for the reporting airlines and not marketing airlines.
- Parts of Skywest Airlines are marketed as American Airlines, Delta Airlines, and others are marketed as United Airlines.
Figure 4.2.3.2: Routes and Unique Routes
- Despite running less than two thirds of number of operations run by Southwestern Airlines, Skywest Airlines runs on most number of unique routes in the United States.
- It can be infered that Skywest is runs operations to many cities of United States than any other airlines.
Figure 4.2.4: Cleveland Plot
top_from <- unique(air_data %>%
select(Origin, Dest)) %>%
group_by(Origin) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
head(50)
ggplot(top_from, aes(x = count, y = reorder(Origin, count))) +
geom_point(color = "blue") +
theme(axis.ticks.y = element_blank(),plot.title = element_text(size = rel(1.2))) +
labs(x = "Number of airports connected", y = "Airport",
title = "Top 20 airports by connectivity to other airports",
subtitle = "O'Hare, Dallas Ford, Atlanta, and Denver are the most connected airports")
- The Top 20 Airports with respect to the number of airports that they are connected to show an interesting picture. Chicago O’Hare International Airport is the most connected airport in the US with flights to 172 different airports. Its followed by Dallas, Atlanta and Denver Airport.
- It is interesting to know that airports located in the West coast are not that well compared as compared to other airports. The most connected Airport on the west coast is at 10 place (Los Angles Internation Airport- LAX)
- Another interesting fact is that despite being the Capital of the World, the 2 aiports in New York City are in the 20th and 22nd place in terms of connectivity.
Figure 4.2.5: Alluvial Plot
exp <- year_data %>% dplyr::filter(Origin== "JFK", DepDelayMinutes > 60)
list1<- unique(exp$Dest)
#list1
#length(list1)
subset1 <- year_data[! is.na(year_data$CancellationCode),] %>% select(Year, Month, FlightDate, Origin, Reporting_Airline, CancellationCode)
subset1$CancellationCode <- as.factor(subset1$CancellationCode)
levels(subset1$CancellationCode) <- c("Carrier","Weather","National Air System","Security")
top10aircodes <- c('JFK', 'ATL', 'SEA', 'SFO', 'LAX', 'PHX','DEN', 'DFW', 'ORD', 'CLT' )
top9airlines <- c('9E','AA','B6','DL','MQ', 'OH', 'OO', 'UA','VX')
top6airlines <- c('9E','AA','B6','DL', 'OO', 'UA')
decset1 <- december_data[! is.na(december_data$CancellationCode), ] %>%
select(Year, Month, DayofMonth, Origin, Reporting_Airline, CancellationCode)
decset1$CancellationCode <- as.factor(decset1$CancellationCode)
levels(decset1$CancellationCode) <- c("Carrier","Weather","National Air System","Security")
decset2 <- decset1 %>% group_by(Year,DayofMonth, CancellationCode) %>%
summarise( count = n() )
library(ggalluvial)
decset3 <- decset1 %>% filter(Origin =='LGA') %>% group_by(Year,DayofMonth, CancellationCode) %>% summarise( count = n() )
demo <- year_data %>% filter( Origin %in% top10aircodes, Dest %in% top10aircodes, Reporting_Airline %in% top6airlines) %>%
group_by( Origin, Dest, Reporting_Airline) %>%
summarise(Freq= n())
demo$Reporting_Airline <- as.factor(demo$Reporting_Airline)
levels(demo$Reporting_Airline) <- c("Endeavour", "American", "JetBlue", "Delta", "Skywest", "United")
ggplot(demo, aes(y = Freq, axis1 = Origin, axis2 = Reporting_Airline, axis3 = Dest)) +
geom_alluvium(aes(fill = Reporting_Airline), width = 1/12) +
geom_stratum(width = 1/12, fill = "white", color = "black") +
geom_label(stat = "stratum", label.strata = TRUE) +
scale_x_discrete(limits = c("Origin Airport", "Airlines","Desination Airport"), expand = c(.05, .05)) +
scale_fill_brewer(type = "qual", palette = "Dark2") +
ggtitle("The Big Three - American, Delta and United.")+
labs(subtitle="Airport Connections visualised with airlines serving different routes",y = "No. of Flights", caption = "Source: bts.transtats.gov") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))+
theme(axis.text.x = element_text(angle = 0))
- We can see a clear domination of three major airlines - United , Delta and United on the nation’s most busiest airports.
- Skywest Airlines is restricted to the West Coast.
- JetBlue only serves specifically on the New York to Los Angeles route.
4.3 Delay
In this subsection we will look at the various attributes that have contributed towards delay. We will also check if there are factors that correlate with delay
Figure 4.3.1: Scatter Plot - Delay vs.Distance
ggplot(exp, aes(y = Distance, x = DepDelayMinutes))+
geom_point(alpha = 0.5)+
ggtitle("Does Flight Distance and Delay correlate? ",
subtitle = "Reasons of Flight Cancelllation of Top 9 Airlines at 10 Busiest Airports") +
labs(y = "Flight Distance ( in Kms )",x= "Departure Delays (in minutes)", caption = "Source: bts.transtats.gov") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
- We see that there no specific correlation between the distance of a flight and the delay it may suffer.
- The Worst delay that a medium and short range flight can suffer from ranges between 500 mins - 1000 mins. That is anywhere between 8.5 Hrs to 16 Hrs.
Figure 4.3.2.1 : Ridgeline Plot ( Time of Delay )
library(tidyverse)
library(ggridges)
december <- december_data
december$FlightDate <- as.Date(december$FlightDate,"%Y-%m-%d")
december_new <- december[december$OriginAirportID == 12478,]
december_new$day <- format(as.Date(december_new$FlightDate,"%Y-%m-%d"),"%d")
december_new$day <- as.integer(december_new$day )
# Filter flights between the dates 21-31 that have delayed departing flights
delayed <- december_new %>%
select("Year","DepDelayMinutes","day") %>%
filter(DepDelayMinutes > 0) %>%
filter(day>=21 & day<=31)
# Plot
ggplot(delayed, aes(x=day,y=Year,group=Year))+
geom_density_ridges(fill="red",alpha=0.4)+
scale_x_continuous(breaks=21:31) +
ggtitle("Planning for christmas holiday trips from JFK?",
subtitle = "Departure delays from JFK in the last 11 days of december every year") +
labs(x="Day of the Month",y = "Year", caption = "Source: http://www.bts.gov/help/aviation/") +
theme(plot.title = element_text(face = "bold",size=20)) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35",size=15)) +
theme(plot.caption = element_text(color = "grey68"))+
theme(axis.text.x = element_text(size=12)) +
theme(axis.text.y = element_text(size=12)) +
theme(axis.title.x = element_text(size=17)) +
theme(axis.title.y = element_text(size=17))
- The above plot presents information on the volume of departing flights that get delayed in JFK in the holiday season (last 11 days in the month of december).
- Major observation from the above graph is that there are delays invariably on all the days between 21st and 31st decemeber.
- Most delays occured on 22nd and 23rd of december consistently in all 5 years, while it uniformly drops a little on 25th.
- Surprisingly, breaking the trend, the volume of delays in 2017 on the 30th of December has been very high.
Figure 4.3.2.2 : Ridgeline Plot ( Time of Delay )
december$FlightDate <- as.Date(december$FlightDate,"%Y-%m-%d")
december_new <- december[december$OriginAirportID == 12953,]
december_new$day <- format(as.Date(december_new$FlightDate,"%Y-%m-%d"),"%d")
december_new$day <- as.integer(december_new$day )
# Filter flights between the dates 21-31 that have delayed departing flights
delayed <- december_new %>%
select("Year","DepDelayMinutes","day") %>%
filter(DepDelayMinutes > 0) %>%
filter(day>=21 & day<=31)
# Plot
ggplot(delayed, aes(x=day,y=Year,group=Year))+
geom_density_ridges(fill="red",alpha=0.4)+scale_x_continuous(breaks=21:31) +
ggtitle("Planning for christmas holiday trips from LGA?",
subtitle = "Departure delays from LGA in the last 11 days of december every year") +
labs(x="Day of the Month",y = "Year", caption = "Source: http://www.bts.gov/help/aviation/") +
theme(plot.title = element_text(face = "bold",size=20)) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35",size=15)) +
theme(plot.caption = element_text(color = "grey68"))+
theme(axis.text.x = element_text(size=12)) +
theme(axis.text.y = element_text(size=12)) +
theme(axis.title.x = element_text(size=17)) +
theme(axis.title.y = element_text(size=17))
- The same pattern seen for JFK is observed here. The large number of delays in these days may be explained by bad weather conditions.
- Number of Flights delayed during different times of day for 7 days of the week
Figure 4.3.2.3 : Ridgeline Plot ( Time of Delay )
airline_data = read_csv(file = "data/jan2018.csv", col_names = TRUE)
dep_delay <- airline_data %>%
select("CRS_DEP_TIME","DEP_DELAY_NEW","DAY_OF_WEEK") %>%
filter(DEP_DELAY_NEW > 0)
dep_delay$DAY_OF_WEEK <- factor(dep_delay$DAY_OF_WEEK)
dep_delay$DAY_OF_WEEK <- recode(dep_delay$DAY_OF_WEEK,
"1" = "Monday" ,"2" = "Tuesday","3" = "Wednesday","4" = "Thursday","5" = "Friday",
"6" = "Saturday","7" = "Sunday")
d1 <- ggplot(dep_delay, aes(x = as.numeric(CRS_DEP_TIME), y = DAY_OF_WEEK,
group = DAY_OF_WEEK)) +
geom_density_ridges(fill = "blue", alpha = .4) +
scale_x_continuous(breaks=c(0000,0600,1200,1800, 2400)) +
labs(x = "24 hour time", y = "Day of the week",
title = "Departure delays during the day with respect to day of the week")
arr_delay <- airline_data %>%
select("CRS_ARR_TIME","ARR_DELAY_NEW","DAY_OF_WEEK") %>%
filter(ARR_DELAY_NEW > 0)
arr_delay$DAY_OF_WEEK <- factor(arr_delay$DAY_OF_WEEK)
arr_delay$DAY_OF_WEEK <- recode(arr_delay$DAY_OF_WEEK,
"1" = "Monday" ,"2" = "Tuesday","3" = "Wednesday","4" = "Thursday","5" = "Friday",
"6" = "Saturday","7" = "Sunday")
d2 <- ggplot(arr_delay, aes(x = as.numeric(CRS_ARR_TIME), y = DAY_OF_WEEK,
group = DAY_OF_WEEK)) +
geom_density_ridges(fill = "green", alpha = .4) +
scale_x_continuous(breaks=c(0000,0600,1200,1800, 2400)) +
labs(x = "24 hour time", y = "Day of the week",
title = "Arrival delays during the day with respect to day of the week")
grid.arrange(d1, d2, nrow = 1)
- The trend in the departure shows that that there are no delays in flights from 0 to 5 AM. This may be because of less frequency of flights scheduled at that moment.
- The departure delays seem to at its highest around 3 PM and 7 PM.
- The Arrival delays seem to be at its highesr at around 5 PM to 9 PM, which is logical as the flights that had a delayed departure seem to arrive later than scheduled.
- There also seems to be small hike in arrival delays during the midnight. This also coloborates our previous logic that flights having a delayed depearture have delayed arrival as well.
Figure 4.3.3: Choropleth - Deay Minutes per state ( Anirudh )
# Statewise delays
df_statewise_delay_arr <- group_by(df,Reporting_Airline,OriginStateName) %>% summarise_at(vars(ArrDelayMinutes), funs(sum(., na.rm=TRUE)))
df_statewise_delay_dep <- group_by(df,Reporting_Airline,DestStateName) %>% summarise_at(vars(DepDelayMinutes), funs(sum(., na.rm=TRUE)))
names(df_statewise_delay_arr)[2] <- "state"
names(df_statewise_delay_dep)[2] <- "state"
names(df_statewise_delay_arr)[3] <- "value"
names(df_statewise_delay_dep)[3] <- "value"
# Count the number of delay minutes per state for Outbound Flights
plt_new_dep <- df_statewise_delay_dep %>% group_by(state) %>% summarise(value = sum(value))
plt_new_dep$region <- tolower(plt_new_dep$state)
# Count the number of delay minutes per state for Inbound Flights
plt_new_arr <- df_statewise_delay_arr %>% group_by(state) %>% summarise(value = sum(value))
plt_new_arr$region <- tolower(plt_new_arr$state)
names(plt_new_arr)[2] <- "Inbound"
names(plt_new_dep)[2] <- "Outbound"
# Merge the dataframes for both Inbound and Outbound Flights
df_del <- merge(x = plt_new_dep,y=plt_new_arr,on="region",all=TRUE)
df_del$value <- df_del$Outbound + df_del$Inbound
# Plotting the cummulative delay minutes per state
state_choropleth(df_del,legend = "Delay Minutes", num_colors = 9) +
ggtitle("Which states are notorious for delays?", subtitle = "Number of Delay minutes per state in 2017")+
labs(caption = "Source: http://www.bts.gov/help/aviation/") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
Figure 4.3.4.1 : Choropleth - Delay with Seasons [October - December’2017] ( Akhil Punia )
dat3 <- year_data
dat3[is.na(dat3)] <- 0
states <- rownames(state.x77)
dat2 <- dat3 %>% filter(Month %in% c(10,11,12)) %>% group_by(OriginStateName) %>% summarize(value = mean(DepDelay) )
dat4 <- dat2[dat2$OriginStateName %in% states,]
dat4 <- rbind(dat4, c("Delaware", 0))
dat4 <- rbind(dat4, c("district of columbia", 0))
dat4$OriginStateName <- tolower(dat4$OriginStateName)
colnames(dat4) <- c("region","value")
dat4$value <- sapply(as.numeric(dat4$value), round,2)
state_choropleth(dat4,legend = "Delay Time Scales")+
ggtitle("Average Flight Delays | Period: October - December 2017",
subtitle = "Average delay times assocaited with different states") +
labs(y = "No. of Flights", caption = "Source: bts.transtats.gov") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
Figure 4.3.4.2: Choropleth - Delay with Seasons [January - March’2018] ( Akhil Punia )
dat2 <- dat3 %>% filter(Month %in% c(1,2,3)) %>% group_by(OriginStateName) %>% summarize(value = mean(DepDelay) )
dat4 <- dat2[dat2$OriginStateName %in% states,]
dat4 <- rbind(dat4, c("Delaware", 0))
dat4 <- rbind(dat4, c("district of columbia", 0))
dat4$OriginStateName <- tolower(dat4$OriginStateName)
colnames(dat4) <- c("region","value")
dat4$value <- sapply(as.numeric(dat4$value), round,2)
state_choropleth(dat4,legend = "Delay Time Scales")+
ggtitle("Average Flight Delays | Period: January - March 2018",
subtitle = "Average delay times assocaited with different states") +
labs(y = "No. of Flights", caption = "Source: bts.transtats.gov") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
Figure 4.3.4.3: Choropleth - Delay with Seasons [April - June’2017] ( Akhil Punia )
dat2 <- dat3 %>% filter(Month %in% c(4,5,6)) %>% group_by(OriginStateName) %>% summarize(value = mean(DepDelay) )
dat4 <- dat2[dat2$OriginStateName %in% states,]
dat4 <- rbind(dat4, c("Delaware", 0))
dat4 <- rbind(dat4, c("district of columbia", 0))
dat4$OriginStateName <- tolower(dat4$OriginStateName)
colnames(dat4) <- c("region","value")
dat4$value <- sapply(as.numeric(dat4$value), round,2)
state_choropleth(dat4,legend = "Delay Time Scales")+
ggtitle("Average Flight Delays | Period: April - June 2018",
subtitle = "Average delay times assocaited with different states") +
labs(y = "No. of Flights", caption = "Source: bts.transtats.gov") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
### Observations:
Figure 4.3.4.4: Choropleth - Delay with Seasons [July - August’2017] ( Akhil Punia )
dat2 <- dat3 %>% filter(Month %in% c(7,8,9)) %>% group_by(OriginStateName) %>% summarize(value = mean(DepDelay) )
dat4 <- dat2[dat2$OriginStateName %in% states,]
dat4 <- rbind(dat4, c("Delaware", 0))
dat4 <- rbind(dat4, c("district of columbia", 0))
dat4$OriginStateName <- tolower(dat4$OriginStateName)
colnames(dat4) <- c("region","value")
dat4$value <- sapply(as.numeric(dat4$value), round,2)
state_choropleth(dat4,legend = "Delay Time Scales")+
ggtitle("Average Flight Delays | Period: July - September 2018",
subtitle = "Average delay times assocaited with different states") +
labs(y = "No. of Flights", caption = "Source: bts.transtats.gov") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
### Observations:
Figure 4.3.5.1 : Bar Chart: Count of Delay/On time/Before Time for [ National ]
# Plot of Delay
df$dep_cat[df$DepDelay >15] <- 1
df$dep_cat[df$DepDelay <0] <- -1
df$dep_cat[df$DepDelay <=15 & df$DepDelay >=0] <- 0
df$arr_cat[df$ArrDelay >15] <- 1
df$arr_cat[df$ArrDelay <0] <- -1
df$arr_cat[df$ArrDelay <=15 & df$ArrDelay >=0] <- 0
# Chart 1 for departing/outbound flights for National
df_dvg1 <- df %>% group_by(Month,dep_cat) %>% summarize(n())
names(df_dvg1)[3] <- "count"
df_dvg1 <-df_dvg1 %>% drop_na()
df_dvg1$dep_cat <- as.factor(df_dvg1$dep_cat)
df_dvg1$Month <- as.factor(df_dvg1$Month)
df_dvg1$Status[df_dvg1$dep_cat==0] <- "On Time"
df_dvg1$Status[df_dvg1$dep_cat==1] <- "Delayed"
df_dvg1$Status[df_dvg1$dep_cat==-1] <- "Before Time"
ch1 <- ggplot(df_dvg1,aes(x=Month,y=count,fill=Status)) + geom_bar(stat = "Identity",position="dodge") +
ylab("Frequency") +
theme_gray(16) +
labs(title="What is the performance of Outbound flights?",
caption = "Source: http://www.bts.gov/help/aviation/",
subtitle = "Outbound Delay per month") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
#ggsave("Outbound Delay per month.png")
# Chart 2 for arriving/inbound flight for National
df_dvg2 <- df %>% group_by(Month,arr_cat) %>% summarize(n())
names(df_dvg2)[3] <- "count"
df_dvg2 <-df_dvg2 %>% drop_na()
df_dvg2$arr_cat <- as.factor(df_dvg2$arr_cat)
df_dvg2$Month <- as.factor(df_dvg2$Month)
df_dvg2$Status[df_dvg2$arr_cat==0] <- "On Time"
df_dvg2$Status[df_dvg2$arr_cat==1] <- "Delayed"
df_dvg2$Status[df_dvg2$arr_cat==-1] <- "Before Time"
ch2 <- ggplot(df_dvg2,aes(x=Month,y=count,fill=Status)) + geom_bar(stat = "Identity",position="dodge") +
ylab("Frequency") +
theme_gray(16) +
labs(title="And Inbound flights?",
caption = "Source: http://www.bts.gov/help/aviation/",
subtitle = "Inbound Delay per month") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
grid.arrange(ch1, ch2, nrow = 1)
### Observations: There is a reductions in the number of flights opereated during winters. The number of delayed flights decrease substantially in the month of october and peak around the summer months suprisingly.
4.4: Cancellations
Now lets look at what each passenger wanting to go home dreads. Cancellation! We will look at where do most of the flights get cancelled and the main reasons for this cancellation.
Figure 4.4.1 : Reasons of Flight Cancelllation of Top 9 Airlines at 10 Busiest Airports
subset2 <- subset1 %>% filter(Origin %in% top10aircodes, Reporting_Airline %in% top9airlines)
subset2$Reporting_Airline <- as.factor(subset2$Reporting_Airline)
levels(subset2$Reporting_Airline) <- c("Endeavour", "American", "JetBlue", "Delta", "Envoy", "PSA Airline", "Skywest", "United", "Virgin")
ggplot(subset2, aes(x= Origin, fill=CancellationCode))+
geom_bar(position = "dodge")+
facet_wrap(~Reporting_Airline)+
ggtitle("Why do flights get cancelled?",
subtitle = "Reasons of Flight Cancelllation of Top 9 Airlines at 10 Busiest Airports") +
labs(y = "No. of Flights", x = "Origin Airports", caption = "Source: bts.transtats.gov") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Figure 4.4.2 : Reasons of Flight Cancelllation of Top 3 Airlines at 10 Busiest Airports [ Executive Summary ]
subset4 <- subset1 %>% group_by(FlightDate, CancellationCode) %>%
summarise( count = n() )
ggplot(subset4, aes(x=FlightDate, y = count))+
geom_line() +
facet_wrap(~CancellationCode)+
ggtitle("Weather wrecks Air Travel ?",
subtitle = "Flight Cancellation Patterns over the Last year") +
labs(y = "No. of Flights", caption = "Source: bts.transtats.gov") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Figure 4.4.3 : Flight cancellations across United Sates in last 5 Decembers
ggplot(decset2, aes(x=DayofMonth, y = count, color= CancellationCode))+
geom_line() +
facet_wrap(~Year)+
ggtitle("Winter kills Holidays, year after year!",
subtitle = "Flight cancellations across United Sates in last 5 Decembers") +
labs(y = "No. of Flights", caption = "Source: bts.transtats.gov") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Figure 4.4.4.1 : Heatmap Cancellations -Origin ( Airports vs. Day of Week )
library(viridis)
airlines <- year_data
air <- airports
# Top 10 originating airports based on number of originating flights
new_air_origin <- airlines %>%
group_by(Origin) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
head(10)
# Filter originating flights from that dataset that are cancelled
cancelled_org <- airlines %>%
select(Origin,Cancelled,DayOfWeek) %>%
filter(Cancelled > 0) %>%
group_by(Origin,DayOfWeek) %>%
summarise(number_cancelled = n())
# Filter rows that belong to top 10 originating airports
new_can_origin <- cancelled_org[cancelled_org$Origin %in% new_air_origin$Origin,]
# Select the rows needed to figure out airport name from air dataset
air_updated <- air %>% select(airport,iata)
# Find airport names for the respective origins using IATA from air dataset
cancelled_origin <- left_join(new_can_origin,air_updated,by=c("Origin" = "iata"))
cancelled_origin$DayOfWeek <- as.factor(cancelled_origin$DayOfWeek)
# Recode the day number to day names
cancelled_origin$DayOfWeek <- recode(cancelled_origin$DayOfWeek, "1"="Monday", "2"="Tuesday", "3"="Wednesday", "4"="Thursday", "5"="Friday", "6"="Saturday", "7"="Sunday")
# Plot the heatmap
ggplot(cancelled_origin,aes(y=airport,x=DayOfWeek,fill=number_cancelled)) +
geom_tile(color="white") +
ggtitle("Planning for a smooth travel without cancellations?",
subtitle = "Number of departing flight cancellations in top 10 airports on each day") +
labs(x="Day of the Week",y = "Airport", caption = "Source: http://www.bts.gov/help/aviation/") +
theme(plot.title = element_text(face = "bold",size=20)) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35",size=15)) +
theme(plot.caption = element_text(color = "grey68"))+
theme(axis.text.x = element_text(angle = 45, hjust = 1,size=12)) +
theme(axis.text.y = element_text(size=12)) +
theme(axis.title.x = element_text(size=17)) +
theme(axis.title.y = element_text(size=17)) +
scale_fill_viridis()
It can be seen from the above heatmap that there are over 300-400 departing flights cancelled in 2018 in the following airports. Please note that this represents cumulative cancellations in the whole year on a particular day of the week.
- George Bush International on Tuesday
- Charlotte/Douglas International on Wednesday
- William B Hartsfield-Atlanta Intl on Wednesday
- LaGuardia on Thursday
This map helps travellers to be aware of which days in a week more number of flights are getting cancelled so that it helps them plan their travel accordingly. On all other days, the number of outgoing flights cancelled are pretty less (<100) in all the airports.
Figure 4.4.5.2 : Heatmap Cancellations - Destination ( Airports vs. Day of Week )
# Top 10 destination airports
new_air_dest <- airlines %>%
group_by(Dest) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
head(10)
# Filter all the destination airports that have cancellations
cancelled_dest <- airlines %>%
select(Dest,Cancelled,DayOfWeek) %>%
filter(Cancelled > 0) %>%
group_by(Dest,DayOfWeek) %>%
summarise(number_cancelled = n())
# Filter top 10 airports from the cancelled ones
new_can_dest <- cancelled_dest[cancelled_dest$Dest %in% new_air_dest$Dest,]
air_updated <- air %>% select(airport,iata)
# Find their airport names
cancelled_dest <- left_join(new_can_dest,air_updated,by=c("Dest" = "iata"))
cancelled_dest$DayOfWeek <- as.factor(cancelled_dest$DayOfWeek)
# Recode days of the week
cancelled_dest$DayOfWeek <- recode(cancelled_dest$DayOfWeek, "1"="Monday", "2"="Tuesday", "3"="Wednesday", "4"="Thursday", "5"="Friday", "6"="Saturday", "7"="Sunday")
ggplot(cancelled_dest,aes(y=airport,x=DayOfWeek,fill=number_cancelled)) +
geom_tile(color="white") +
ggtitle("Planning for a smooth travel without cancellations?",
subtitle = "Number of arriving flights cancelled in top 10 airports on each day") +
labs(x="Day of the Week",y = "Airport", caption = "Source: http://www.bts.gov/help/aviation/") +
theme(plot.title = element_text(face = "bold",size=20)) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35",size=15)) +
theme(plot.caption = element_text(color = "grey68"))+
theme(axis.text.x = element_text(angle = 45, hjust = 1,size=12)) +
theme(axis.text.y = element_text(size=12)) +
theme(axis.title.x = element_text(size=17)) +
theme(axis.title.y = element_text(size=17)) +
scale_fill_viridis()
It can be seen from the above heatmap that there are over 300-400 arriving flights cancelled in 2018 in the following airports. Please note that this represents cumulative cancellations in the whole year on a particular day of the week.
- George Bush Intercontinental on Tuesday
- Charlotte/Douglas International on Wednesday and Thursday
- William B Hartsfield-Atlanta Intl on Wednesday
- LaGuardia on Thursday
Figure 4.4.6 : Choropleth: Number of Cancellations per state (both inbound + outbound)
# Counting the number of cancellations per state for both Inbound and Outbound flights
df_statewise_can_arr <- group_by(df,Reporting_Airline,OriginStateName) %>% summarise_at(vars(Cancelled), funs(sum(., na.rm=TRUE))) %>% group_by(OriginStateName) %>% summarise(value = sum(Cancelled))
df_statewise_can_dep <- group_by(df,Reporting_Airline,DestStateName) %>% summarise_at(vars(Cancelled), funs(sum(., na.rm=TRUE))) %>% group_by(DestStateName) %>% summarise(value = sum(Cancelled))
names(df_statewise_can_arr)[1] <- "region"
names(df_statewise_can_arr)[2] <- "In_can"
names(df_statewise_can_dep)[1] <- "region"
names(df_statewise_can_dep)[2] <- "Out_can"
# Merging the dataframes for Inbound and Outbound flights
df_can <- merge(x = df_statewise_can_arr,y=df_statewise_can_dep,on="region",all=TRUE)
df_can$value <- df_can$In_can + df_can$Out_can
df_can$region <- tolower(df_can$region)
names(ms_df_tot)[4] <- 'num' # number of flights
names(df_del)[5] <- 'del' # Number of Delays
names(df_can)[4]<- "can" # Number of cancelations
# Calulcating the average delay per flight per state and the percentage of cancellations
df_ratio <- merge(x = ms_df_tot,y=df_del,by="region",all=TRUE) %>%merge(y=df_can,by="region",all=TRUE)
df_ratio$del_ratio <- df_ratio$del / df_ratio$num
df_ratio$can_per <- df_ratio$can *100 / df_ratio$num
del_rat <- c("region","del_ratio")
del_rat_df <- df_ratio %>% subset(select = del_rat)
names(del_rat_df)[2] <- "value"
per_can <- c("region","can_per")
can_per_df <- df_ratio %>% subset(select = per_can)
names(can_per_df)[2] <- "value"
# Plotting the percenatage of cancellations per state
state_choropleth(can_per_df,legend = "Percentage of Cancellations", num_colors = 9) +
ggtitle("Which states have the highest flight cancellation percentages?",
subtitle = "Percetange of cancellations for diffrent state in 2017")+
labs(caption = "Source: http://www.bts.gov/help/aviation/") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
- A similar pattern for the percentage of cancellations occurs when comparing with the ratio of delays.
- Florida has a rather higher delay statirstic, but the number of cancellations is pretty low.
5: Distance
In this small subsection, we wil look at what distances are covered by the airlines.
Figure 5.5.1 : BoxPlot ( Flight Time with Distance Group )
airlines1 <- airlines
airlines1$DistanceGroup = as.factor(airlines$DistanceGroup)
airlines1$DistanceGroup <- recode(airlines1$DistanceGroup, "1"="0-250", "2"="250-500", "3"="500-750", "4"="750-1000", "5"="1000-1250", "6"="1250-1500", "7"="1500-1750", "8"="1750-2000", "9"="2000-2250", "10"="2250-2500", "11"="2500-2750")
ggplot()+
geom_boxplot(mapping=aes(x=DistanceGroup,y=ActualElapsedTime),data=airlines1)+
ggtitle("Flight duration as a function of distance") +
labs(x="Distance range in miles",y = "Flight duration in minutes", caption = "Source:http://www.bts.gov/help/aviation/")+
theme(plot.title = element_text(face = "bold",size=20)) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35",size=15)) +
theme(plot.caption = element_text(color = "grey68"))+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) +
theme(axis.text.y = element_text(size=12)) +
theme(axis.title.x = element_text(size=17)) +
theme(axis.title.y = element_text(size=17))
The above box plot shows the distribution of flight duration for each of the distance groups.
- It can be seen that there are a lot of outliers in each case. For example, consider the outliers of distance group 1.
- They match the average duration taken by flights in distance group 5 and 6.
- This raises curiosity in the observer as to what might be happening with each of those flights that are outliers.
- It can be perceived as the kind of route each of those flights take to cover the distance.
- It can also be associated with the type of aircraft that is flown. Old models tend to fly with less speed. Let us analyse more on this.
Figure 5.5.2 : Mosaic Plot - Distance Group vs Airline
# mosaic of distance vs airline
ms <- c("DistanceGroup","Reporting_Airline")
ms_df <- df %>% select(ms) %>% group_by(Reporting_Airline,DistanceGroup)
# Reordering the airlines on the basis of number of flights
ms_df$Reporting_Airline <- as.factor(ms_df$Reporting_Airline)
ms_df$Reporting_Airline <- reorder(ms_df$Reporting_Airline,x=ms_df$Reporting_Airline,FUN = function(x) - length(x))
# Plotting the mosaic plot of airlines vs reporting group
ggplot(ms_df) + geom_mosaic(aes(x=product(DistanceGroup,Reporting_Airline)),fill="Black") + xlab("Airline") + ylab("Distance Group") + theme_gray(16) +
labs(title="How far do the airline carriers fly?",
caption = "Source: http://www.bts.gov/help/aviation/",
subtitle = "Mosaic plot of Airline vs Reporting Group") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
- Some airline such as Hawai Airlines have no middle range flights, but rather services intra-hawai flights and flights to continental USA.
- Some carriers such as UA, WN,YX, EV, MQ, OH etc. have lesser number of flights servicing long distance flights.
- OO ( or SkyWest Airlines ) primarily services small distance flights.